output coordinate
Principles of Riemannian Geometry in Neural Networks
This study deals with neural networks in the sense of geometric transformations acting on the coordinate representation of the underlying data manifold which the data is sampled from. It forms part of an attempt to construct a formalized general theory of neural networks in the setting of Riemannian geometry. From this perspective, the following theoretical results are developed and proven for feedforward networks. First it is shown that residual neural networks are nite dierence approximations to dynamical systems of rst order dierential equations, as opposed to ordinary networks that are static. This implies that the network is learning systems of dierential equations governing the coordinate transformations that represent the data. Second it is shown that a closed form solution of the metric tensor on the underlying data manifold can be found by backpropagating the coordinate representations learned by the neural network itself. This is formulated in a formal abstract sense as a sequence of Lie group actions on the metric bre space in the principal and associated bundles on the data manifold. Toy experiments were run to conrm parts of the proposed theory, as well as to provide intuitions as to how neural networks operate on data.
Accelerating Sparse Convolutions in Voxel-Based Point Cloud Networks
Adamopoulos, Dionysios, Poulopoulou, Anastasia, Goumas, Georgios, Giannoula, Christina
Sparse Convolution (SpC) powers 3D point cloud networks widely used in autonomous driving and AR/VR. SpC builds a kernel map that stores mappings between input voxel coordinates, output coordinates, and weight offsets, then uses this map to compute feature vectors for output coordinates. Our work identifies three key properties of voxel coordinates: they are integer-valued, bounded within a limited spatial range, and geometrically continuous-neighboring voxels on the same object surface are highly likely to exist at small spatial offsets from each other. Prior SpC engines do not fully exploit these properties and suffer from high pre-processing and post-processing overheads during kernel map construction. To address this, we design Spira, the first voxel-property-aware SpC engine for GPUs. Spira proposes: (i) a high-performance one-shot search algorithm that builds the kernel map with no preprocessing and high memory locality, (ii) an effective packed-native processing scheme that accesses packed voxel coordinates at low cost, (iii) a flexible dual-dataflow execution mechanism that efficiently computes output feature vectors by adapting to layer characteristics, and (iv) a network-wide parallelization strategy that builds kernel maps for all SpC layers concurrently at network start. Our evaluation shows that Spira significantly outperforms prior SpC engines by 1.71x on average and up to 2.31x for end-to-end inference, and by 2.13x on average and up to 3.32x for layer-wise execution across diverse layer configurations.
Ehrenfeucht-Haussler Rank and Chain of Thought
Barceló, Pablo, Kozachinskiy, Alexander, Steifer, Tomasz
The notion of rank of a Boolean function has been a cornerstone in the theory of PAC learning, enabling quasipolynomial-time learning algorithms for polynomial-size decision trees. We present a novel characterization of rank, grounded in the well-known Transformer architecture. We show that the rank of a function $f$ corresponds to the minimum number of Chain of Thought (CoT) steps required by a single-layer transformer decoder with hard attention to compute $f$. Based on this characterization we establish tight bounds on the number of CoT steps required for specific problems, showing that $\ell$-fold function composition necessitates exactly $\ell$ CoT steps. Furthermore, we analyze the problem of identifying the position of the $k$-th occurrence of 1 in a Boolean sequence, proving that it requires $k$ CoT steps.
Minuet: Accelerating 3D Sparse Convolutions on GPUs
Yang, Jiacheng, Giannoula, Christina, Wu, Jun, Elhoushi, Mostafa, Gleeson, James, Pekhimenko, Gennady
Sparse Convolution (SC) is widely used for processing 3D point clouds that are inherently sparse. Different from dense convolution, SC preserves the sparsity of the input point cloud by only allowing outputs to specific locations. To efficiently compute SC, prior SC engines first use hash tables to build a kernel map that stores the necessary General Matrix Multiplication (GEMM) operations to be executed (Map step), and then use a Gather-GEMM-Scatter process to execute these GEMM operations (GMaS step). In this work, we analyze the shortcomings of prior state-of-the-art SC engines, and propose Minuet, a novel memory-efficient SC engine tailored for modern GPUs. Minuet proposes to (i) replace the hash tables used in the Map step with a novel segmented sorting double-traversed binary search algorithm that highly utilizes the on-chip memory hierarchy of GPUs, (ii) use a lightweight scheme to autotune the tile size in the Gather and Scatter operations of the GMaS step, such that to adapt the execution to the particular characteristics of each SC layer, dataset, and GPU architecture, and (iii) employ a padding-efficient GEMM grouping approach that reduces both memory padding and kernel launching overheads. Our evaluations show that Minuet significantly outperforms prior SC engines by on average $1.74\times$ (up to $2.22\times$) for end-to-end point cloud network executions. Our novel segmented sorting double-traversed binary search algorithm achieves superior speedups by $15.8\times$ on average (up to $26.8\times$) over prior SC engines in the Map step. The source code of Minuet is publicly available at https://github.com/UofT-EcoSystem/Minuet.
Minimax Optimality (Probably) Doesn't Imply Distribution Learning for GANs
Chen, Sitan, Li, Jerry, Li, Yuanzhi, Meka, Raghu
Arguably the most fundamental question in the theory of generative adversarial networks (GANs) is to understand to what extent GANs can actually learn the underlying distribution. Theoretical and empirical evidence suggests local optimality of the empirical training objective is insufficient. Yet, it does not rule out the possibility that achieving a true population minimax optimal solution might imply distribution learning. In this paper, we show that standard cryptographic assumptions imply that this stronger condition is still insufficient. Namely, we show that if local pseudorandom generators (PRGs) exist, then for a large family of natural continuous target distributions, there are ReLU network generators of constant depth and polynomial size which take Gaussian random seeds so that (i) the output is far in Wasserstein distance from the target distribution, but (ii) no polynomially large Lipschitz discriminator ReLU network can detect this. This implies that even achieving a population minimax optimal solution to the Wasserstein GAN objective is likely insufficient for distribution learning in the usual statistical sense. Our techniques reveal a deep connection between GANs and PRGs, which we believe will lead to further insights into the computational landscape of GANs.